Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of change
It may occur for some taps, such as the DynamoDB one (using DynamoDB Streams) that duplicates are generated. To solve this, based on an attribute (Typically a timestamp), a query will run before the loading part to BQ, removing all duplicates.
Manual QA steps
In order to run the deduplication, we need to specify as environment variables, the following attributes:
deduplication_property
: we need to specify the attribute that we want to deduplicate on (Tipically a timestamp).In case of having a duplicate, the query is going to keep the one, by default, with the bigger
deduplication_property
or a random element between those who have the bigger one.But we can also modify the order, if for example we want to keep the smaller one.
deduplication_order
:Additional info
For example, let's say we have the following data coming from the tap:
In this case, being
date
as deduplication_property and deduplication_order by default (DESC) we would have:If we set the deduplication_order to ASC, we would have: